Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish

نویسندگان

چکیده

Abstract In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows extraction challenging examples paraphrase pairs their textual context, leading to a dataset potentially more suitable for evaluating models’ ability represent meaning, especially document when compared with those gathered using various sentence-level heuristics. To end, introduce Turku Paraphrase Corpus, first large-scale, fully manually annotated paraphrases Finnish. The contains 104,645 labeled pairs, which 98% are verified be true paraphrases, either universally or within present context. order control diversity avoid certain biases easily introduced automatic candidate extraction, collected different paraphrase-rich text sources. This us create including longer lexically diverse than can expected through addition quality, also keep original context each pair, making it possible our knowledge, is provides pairs. several models trained evaluated new data. Our initial classification experiments indicate nature classifying detailed labeling scheme used annotation, accuracy substantially lacking behind human performance. However, large scale retrieval task almost 400M sentences, results highly encouraging, 29–53% being ranked top 10 depending type. Corpus available at github.com/TurkuNLP/Turku-paraphrase-corpus as well popular HuggingFace datasets under CC-BY-SA license.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

passivity in waiting for godot and endgame: a psychoanalytic reading

this study intends to investigate samuel beckett’s waiting for godot and endgame under the lacanian psychoanalysis. it begins by explaining the most important concepts of lacanian psychoanalysis. the beckettian characters are studied regarding their state of unconscious, and not the state of consciousness as is common in most beckett studies. according to lacan, language plays the sole role in ...

Towards Effective Tutorial Feedback for Explanation Questions: A Dataset and Baselines

We propose a new shared task on grading student answers with the goal of enabling welltargeted and flexible feedback in a tutorial dialogue setting. We provide an annotated corpus designed for the purpose, a precise specification for a prediction task and an associated evaluation methodology. The task is feasible but non-trivial, which is demonstrated by creating and comparing three alternative...

متن کامل

designing and validating a textbook evaluation questionnaire for reading comprehension ii and exploring its relationship with achievement

در هر برنامه آموزشی، مهم ترین فاکتور موثر بر موفقیت دانش آموزان کتاب درسی است (مک دونو و شاو 2003). در حقیقت ، کتاب قلب آموزش زبان انگلیسی است( شلدن 1988). به دلیل اهمیت والای کتاب به عنوان عنصر ضروری کلاس های آموزش زبان ، کتب باید به دقت ارزیابی و انتخاب شده تا از هرگونه تاثیر منفی بر دانش آموزان جلوگیری شود( لیتز). این تحقیق با طراحی پرسش نامه ارزیابی کتاب که فرصت ارزیابی معتبر را به اساتید د...

15 صفحه اول

the innovation of a statistical model to estimate dependable rainfall (dr) and develop it for determination and classification of drought and wet years of iran

آب حاصل از بارش منبع تأمین نیازهای بی شمار جانداران به ویژه انسان است و هرگونه کاهش در کم و کیف آن مستقیماً حیات موجودات زنده را تحت تأثیر منفی قرار می دهد. نوسان سال به سال بارش از ویژگی های اساسی و بسیار مهم بارش های سالانه ایران محسوب می شود که آثار زیان بار آن در تمام عرصه های اقتصادی، اجتماعی و حتی سیاسی- امنیتی به نحوی منعکس می شود. چون میزان آب ناشی از بارش یکی از مولفه های اصلی برنامه ...

15 صفحه اول

Towards Automatic Construction of Diverse, High-quality Image Dataset

The availability of labeled image datasets has been shown critical for high-level image understanding, which continuously drives the progress of feature designing and models developing. However, constructing labeled image datasets is laborious and monotonous. To eliminate manual annotation, in this work, we propose a novel image dataset construction framework by employing multiple textual metad...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Natural Language Engineering

سال: 2023

ISSN: ['1469-8110', '1351-3249']

DOI: https://doi.org/10.1017/s1351324923000086